Skip to content

Add OpenTelemetry distributed tracing across all Salt hops#69145

Open
dwoz wants to merge 7 commits into
saltstack:3008.xfrom
dwoz:tracing-otel
Open

Add OpenTelemetry distributed tracing across all Salt hops#69145
dwoz wants to merge 7 commits into
saltstack:3008.xfrom
dwoz:tracing-otel

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented May 14, 2026

Instrument every inter-process hop (network and IPC) with W3C TraceContext-propagated spans exported via OTLP. Tracing is opt-in via opts['tracing']['enabled'] (default false) — when disabled the new salt.utils.tracing module is a complete no-op: no exporter, no background threads, no payload bloat.

The channel layer is the single chokepoint for all three transports (zeromq / tcp / ws): _package_load injects the traceparent inside the AES-encrypted load before session_crypticle.dumps, and ReqServerChannel.handle_message extracts it after decryption and wraps payload_handler dispatch in a server span. This keeps trace context invisible to wire eavesdroppers while making it available to authenticated participants on both sides.

Hops covered: CLI / salt-call root spans, LocalClient → master, master → minion publish, minion → master return, minion execution, event bus (fire_event injects into data dict — covers IPC and TCP-IPC uniformly), reactor dispatch, syndic forwarding, salt-ssh (TRACEPARENT env var prepended to the shim invocation), and salt-api (extracts traceparent HTTP header; webhooks propagate into fired events).

BatchSpanProcessor's background thread does not survive fork, so the provider is rebuilt per-PID: every public tracing API calls _ensure_tracer which detects a PID change and rebuilds. Forked workers (MWorker, minion executors, reactor) explicitly call tracing.configure(opts) at the top of their entry point.

Configuration block under tracing: enabled, exporter (otlp-grpc / otlp-http / console), endpoint, service_name (auto-derived from role when empty), sampler, sampler_arg, resource_attributes, insecure, headers. Defaults added to both DEFAULT_MASTER_OPTS and DEFAULT_MINION_OPTS.

Tests cover the module surface, no-op fallback, fork isolation, channel-layer injection inside the AES envelope (not on the outer wrapper), event-bus round-trip and reactor extraction.

twangboy
twangboy previously approved these changes May 14, 2026
Daniel A. Wozniak added 7 commits May 16, 2026 01:09
Instrument every inter-process hop (network and IPC) with W3C
TraceContext-propagated spans exported via OTLP.  Tracing is opt-in via
``opts['tracing']['enabled']`` (default ``false``) — when disabled the
new ``salt.utils.tracing`` module is a complete no-op: no exporter, no
background threads, no payload bloat.

The channel layer is the single chokepoint for all three transports
(zeromq / tcp / ws): ``_package_load`` injects the traceparent inside
the AES-encrypted load before ``session_crypticle.dumps``, and
``ReqServerChannel.handle_message`` extracts it after decryption and
wraps ``payload_handler`` dispatch in a server span.  This keeps trace
context invisible to wire eavesdroppers while making it available to
authenticated participants on both sides.

Hops covered: CLI / salt-call root spans, LocalClient → master,
master → minion publish, minion → master return, minion execution,
event bus (``fire_event`` injects into ``data`` dict — covers IPC and
TCP-IPC uniformly), reactor dispatch, syndic forwarding, salt-ssh
(``TRACEPARENT`` env var prepended to the shim invocation), and
salt-api (extracts ``traceparent`` HTTP header; webhooks propagate
into fired events).

``BatchSpanProcessor``'s background thread does not survive ``fork``,
so the provider is rebuilt per-PID: every public tracing API calls
``_ensure_tracer`` which detects a PID change and rebuilds.  Forked
workers (MWorker, minion executors, reactor) explicitly call
``tracing.configure(opts)`` at the top of their entry point.

Configuration block under ``tracing``: ``enabled``, ``exporter``
(otlp-grpc / otlp-http / console), ``endpoint``, ``service_name``
(auto-derived from role when empty), ``sampler``, ``sampler_arg``,
``resource_attributes``, ``insecure``, ``headers``.  Defaults added
to both ``DEFAULT_MASTER_OPTS`` and ``DEFAULT_MINION_OPTS``.

Tests cover the module surface, no-op fallback, fork isolation,
channel-layer injection inside the AES envelope (not on the outer
wrapper), event-bus round-trip and reactor extraction.
The Generate MAN Pages step builds sphinx with ``-W`` (warnings as
errors).  The new ``doc/topics/tracing/index.rst`` triggered two:

* The title overline/underline were one character shorter than the title
  text, which sphinx reports as ``WARNING: Title overline too short``.
  Extend both delimiter lines to the full length of the title.

* The page wasn't referenced in any toctree, producing
  ``WARNING: document isn't included in any toctree``.  Add it to the
  main ``contents.rst`` between the event-bus and orchestrate sections.
The CI onedir / source-package builds (run 25888742458) failed on every
matrix slot that pins Python 3.14: ``grpcio`` 1.80.0 has no manylinux
wheel for 3.14, so pip falls back to a source build inside the
salt-onedir build environment and the bundled BoringSSL ASM trips the
toolchain's ``-Werror=implicit-function-declaration``.  ``grpcio`` was
pulled in transitively via ``opentelemetry-exporter-otlp-proto-grpc``,
which we had listed in ``requirements/base.txt``.

Remove the gRPC exporter from base requirements and keep only
``opentelemetry-exporter-otlp-proto-http``, which is pure Python, has
no native deps, and ships wheels for every supported interpreter.  Most
collectors (Jaeger included) ingest OTLP/HTTP natively, so this is a
near-zero-cost UX change.

The gRPC exporter is still selectable: users can install
``opentelemetry-exporter-otlp-proto-grpc`` themselves and set
``tracing.exporter: otlp-grpc``.  The import now lives inside
``_build_exporter`` as a lazy ``try/except ImportError`` so a missing
optional package degrades to a logged error instead of breaking module
import or daemon startup.

Default exporter switched to ``otlp-http`` in both
``DEFAULT_MASTER_OPTS`` and ``DEFAULT_MINION_OPTS``.  Docs and example
Jaeger demo updated (port 4318/v1/traces).  New unit test covers the
graceful-when-grpc-missing path.  Static lockfiles regenerated by
pre-commit to drop ``grpcio`` and the gRPC exporter package.
CI run 25901418243 surfaced 39 failures spread across three buckets
(Windows MSI/NSIS upgrade-downgrade tests, unit zeromq shard 4, and
integration zeromq/tcp shard 1).  All of them share the same backtrace:

    File ".../salt/utils/tracing.py", line 46, in <module>
        from opentelemetry import context as otel_context
    ModuleNotFoundError: No module named 'opentelemetry'

The hard import at module top broke three real environments:

* the salt-ssh **thin tarball**, which by design only ships salt's own
  source — no third-party dependencies follow it to the remote host;
* **older installed onedirs** (3007.14 / 3008.0rc3) whose bundled Python
  does not yet contain opentelemetry, exercised by the upgrade /
  downgrade test matrix;
* **test_thin_dir** which builds the same thin tarball locally and
  invokes it via the host Python.

Wrap every opentelemetry import in ``try / except ImportError`` and
guard module state with an ``_OTEL_AVAILABLE`` flag.  When the package
is missing:

* ``is_enabled()`` always returns False, regardless of opts.
* ``start_span(...)`` returns the singleton ``_NOOP_SPAN``.
* ``inject()`` / ``extract()`` short-circuit.
* ``configure()`` logs one warning if a user has ``enabled: true`` but
  opentelemetry is absent, then becomes a no-op.
* ``SpanKind`` is a duck-typed stub (``SpanKind.SERVER == "SERVER"``)
  so the call sites that pass it through continue to work.
* ``_NoopSpan.get_span_context()`` returns a duck-typed invalid context.

The "tracing is fully opt-in, never affects performance unless
configured" contract is preserved.

A subprocess-isolated unit test
(``test_module_works_when_opentelemetry_missing``) installs a
meta_path finder that raises ImportError for any opentelemetry import,
re-imports ``salt.utils.tracing``, and asserts every public API stays
silent.  Subprocess isolation is necessary because reloading the
module in-process would leave other modules holding references to the
otel-blocked variant.
The rebase onto ``origin/3008.x`` resolved its conflicts with
``-X theirs``, which preserved my local view of the lockfiles instead
of letting pip-compile reconcile them with the upstream urllib3==2.7.0
bump.  CI's Pre-Commit job flagged every Py3.13 and Py3.14 packaging /
CI / lint requirements file as needing regeneration.

Re-run pip-compile via pre-commit locally; the resulting files agree
with both the new urllib3 pin and the trimmed opentelemetry-without-grpc
top-level set, so the Pre-Commit job will now pass.
``Build Source Packages / DEB (arm64)`` consistently fails on CI run
25947204634 (and the prior 25941045430) at the ``pip install`` step
inside ``debuild`` with::

    /tmp/ccuWvGgq.s:653: Error: unknown mnemonic `bti' -- `bti jc'
    ...
    Failed to build protobuf

``protobuf`` is a new transitive dependency introduced by
``opentelemetry-exporter-otlp-proto-http``
(via ``opentelemetry-proto`` → ``protobuf<7.0,>=5.0``).  The onedir
install in :mod:`tools.pkg.build` passes ``--no-binary=:all:`` plus an
explicit ``--only-binary`` allowlist, and ``protobuf`` isn't on that
allowlist — so pip falls back to building the C/C++ extension from
source.  On Python 3.14 aarch64 the bundled BoringSSL assembly uses the
ARMv8.5 ``bti`` instruction; the GAS shipped with the relenv aarch64
toolchain is too old to recognise that mnemonic, and the build fails.

``protobuf`` 6.33.x ships a ``cp39-abi3-manylinux2014_aarch64.whl``
(stable-ABI wheel that works on every supported interpreter including
3.14), so allowing the binary wheel resolves the failure cleanly without
toolchain or runtime changes.  x86_64 source builds happen to succeed
today but go through the same fragile ASM path, so adding the package
to the allowlist also reduces blast radius there.

Add ``protobuf`` to both the cross-platform and the Linux/macOS
``--only-binary`` lists in :func:`onedir_dependencies`.
The rebase onto current ``origin/3008.x`` used ``-X theirs`` to auto-
resolve conflicts in the static requirements lockfiles (origin bumped
``certifi``, ``croniter`` and a handful of other transitive pins on
top of the earlier urllib3 update).  That preserves my view of the
lockfiles instead of letting ``pip-compile`` reconcile them, so the
Pre-Commit CI hook would flag drift.

Re-run pip-compile via pre-commit locally; the resulting files agree
with both the new upstream pins and the trimmed opentelemetry-without-
grpc top-level set, so the Pre-Commit job stays green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants